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ABSTRACT 

Some  wavelet-based  methods  for  signal  estimation  in  the 
presence  of  noise  are  reviewed  in  the  context  of  the  parsi¬ 
monious  representation  of  the  underlying  signal.  Three  ap¬ 
proaches  are  considered.  The  first  is  based  on  the  applica¬ 
tion  of  the  MDL  principle.  The  robustness  of  this  method  is 
improved  in  the  second  approach,  by  relaxing  the  assump¬ 
tion  of  known  noise  distribution  following  Huber’s  work.  In 
the  third  approach,  a  Bayesian  strategy  is  adopted  in  order  to 
incorporate  prior  information  pertaining  to  the  signal  of  in¬ 
terest;  this  method  is  especially  useful  at  low  signal-to-noise 
ratios. 

1.  INTRODUCTION 

Model  parsimony  has  been  of  growing  interest  to  researchers 
in  recent  years,  motivated  by  factors  as  diverse  as  storage  in 
computer  memory,  computational  efficiency,  and  communi¬ 
cation. 

The  proposed  techniques  are  many,  each  entailing  heuris¬ 
tics  and  allowing  interpretations  proper  to  a  particular  ap¬ 
plication.  As  a  result,  the  common  theme  uniting  these  dif¬ 
ferent  approaches  sometimes  seems  hopelessly  inaccessible. 
Nevertheless,  it  is  possible  to  cast  many  of  these  application- 
specific  methodologies  as  problems  of  “regularization”. 

It  is  often  desired  to  limit  the  number  of  degrees  of  free¬ 
dom  in  inverse  problems  by  assuming  a  prior  and  thereby 
mitigating  their  ill-posedness.  In  pattern  recognition,  one  is 
typically  interested  in  the  most  parsimonious  model  that  cap¬ 
tures  whatever  information  in  the  data  is  deemed  essential, 
while  a  penalty  for  model  mismatch  plays  the  role  of  a  prior 
for  model  parameters. 

In  this  paper,  we  discuss  several  wavelet-based  methods 
for  signal  estimation  in  the  presence  of  noise,  within  the  con¬ 
text  of  the  parsimonious  representation  of  the  underlying  sig¬ 
nal.  We  show  in  particular  that  Rissanen’s  Minimum  De¬ 
scription  Length  (MDL)  principle  can  be  applied  to  wavelet 


reconstructions  to  determine  the  complexity  of  the  signal  rep¬ 
resentation,  i.e.  to  choose  which  coefficients  to  include  in 
the  reconstruction,  and  which  to  dismiss  as  noise. 

The  paper  is  organized  as  follows:  Section  2  presents  the 
problem  statement.  In  Section  3,  we  highlight  the  impor¬ 
tance  of  model  parsimony  to  the  signal  denoising  problem. 
In  Section  4,  we  describe  two  information-theoretic  meth¬ 
ods  for  signal  estimation  via  MDL.  Finally,  in  Section  5,  we 
discuss  a  statistical  approach  that  permits  the  introduction  of 
prior  information  on  the  signal  of  interest  when  its  is  embed¬ 
ded  in  high-intensity  noise. 

2.  PROBLEM  STATEMENT  AND  NOTATION 

The  estimation  problem  of  interest  in  this  paper  assumes  the 
following  observation  model: 

a:(2)  =  s(t)  +  n(t),  (1) 

where  s(t)  is  an  unknown  signal  corrupted  by  the  zero-mean 
noise  process  n(t). 

The  underlying  signal  is  modeled  as  an  orthonormal  ba¬ 
sis  representation, 

i 

which  in  turn  leads  to  the  working  model 

Ci  =  c;  +  cp,  ie{ i, ■■■,!<},  (2) 

where  C,  are  the  corrupted  coefficients.  In  many  cases,  the 
noise  coefficients  C"  can  be  assumed  independent;  they  sha¬ 
re  the  same  second-order  statistical  properties  as  n(t),  when 
this  is  a  white  noise  sequence.  Our  problem  is  to  recover 
or  reconstruct  s(t )  from  the  orthogonal  transform  of  the  ob¬ 
served  process  x(t). 


3.  PARSIMONY  AND  DENOISING 

The  unitary  transformation  of  a  process  afforded  by  the  wa¬ 
velet  decomposition  provides  a  complete  statistical  charac¬ 
terization  of  that  process  in  the  transform  domain.  The  fact 
that  the  properties  of  the  underlying  signal  and  of  the  conta¬ 
minating  noise  are  well  characterized,  together  with  the  or¬ 
thogonality  of  the  transform  (which  maximally  removes  any 
redundancy),  suggest  the  potential  efficiency  of  this  approach 
for  the  statistical  separation  of  signal  and  noise.  An  addi¬ 
tional  feature  of  this  transformation,  which  in  many  cases 
turns  out  to  be  crucial,  is  the  property  of  vanishing  moments 
of  the  basis  functions.  This  property  tends  to  concentrate  en¬ 
ergy  into  very  few  dimensions.  If  the  noise  is  white,  then  a 
subset  of  the  dimensions  will  represent  mostly  signal,  and 
the  identification  of  this  subset  is  very  reminiscent  of  the  mo¬ 
del  order  identification  problem,  where  space  is  partitioned 
into  what  might  be  referred  to  as  the  signal  subspace  and  the 
noise  subspace, 


C  =  Cs  ©  Cn  .  (3) 

This  subspace  identification  can  also  be  carried  out  objec¬ 
tively  through  the  likelihood  of  C.  A  model  prior  on  the  pa¬ 
rameters  must  now  be  assigned  first  to  reduce  the  class  of 
possible  models,  and  to  account  for  model  mismatch  [7]: 

£  (C,  K,  P)  =  -  log p(C  |  Cs)  +  a(K,  P ), 

where  K  and  P  are  respectively  the  data  length  and  the  num¬ 
ber  of  signal  dimensions.  Rissanen  refers  to  £  as  descrip¬ 
tion  length  which,  upon  minimization,  represents  the  coding 
length  of  the  observed  series  { C, } .  This  coding  parsimony, 
together  with  the  model  summarizing  the  pertinent  informa¬ 
tion  underlying  the  process,  form  the  basis  of  an  interesting 
methodology  developed  over  the  last  few  years  and  retraced 
below  with  the  rationale  and  hindsight  afforded  by  time. 

4.  AN  INFORMATION-THEORETIC  APPROACH 

In  what  follows,  we  assume  that  the  underlying  signal  ,s(/ ) 
is  a  deterministic  but  unknown  signal  in  L2(R).  For  the  con¬ 
taminating  noise,  we  consider  two  cases: 

•  The  probability  density  function  of  the  contaminating 
noise  is  assumed  to  be  known. 


noise,  simplifies  our  formulation  of  the  denoising  problem 
as  one  of  compression.  The  efficiency  of  the  resulting  solu¬ 
tion  is  qualitatively  and  quantitatively  reflected  by  the  MDL, 
whose  rationale  is  to  seek  and  determine  the  shortest  cod¬ 
ing  length  of  a  data  sequence  {Cf}i<t-</f  which  best  sum¬ 
marizes  the  relevant  information  embedded  in  the  observed 
process.  Recall  the  coefficients  are  assumed  independent. 
It  then  follows  that  their  joint  probability  density  function 
(pdf)  is, 

/  k  \ 


exp 


-'E'Pi(Ci-cn 


where  ipi  is  a  known  “potential”  function.  For  instance,  by 
choosing 


ipi(u)  = 


u|* 

J, 


log( 


Pi 


27fr(i//?,) 


(4) 


an  exponential-power  distributionis  obtained  with  (pi ,  7, )  6 
(1R+)2.  The  Gaussian  distribution  corresponds  to  pi  =  2 
and  the  Laplacian  distribution  to  pi  =  1.  Note  that  differ¬ 
ent  functions  ip,  can  be  chosen  so  as  to  take  into  account,  for 
example,  different  statistics  of  the  noise  at  each  scale.  The 
above  pdf  can  be  viewed  as  a  function  p(C\ , ,  Ck  |  C) 
where  the  parameter  vector  is  given  by 


t=(iU,.,;iP,C!l.....:C!p),  (5) 

P  being  the  number  of  “principal  directions”  of  the  sequence 
{C-}i<i<K,  which  is  assumed  to  satisfy 

C?  ^  0  iff  1  <  /  <  P  .  (6) 

The  unknown  parameters  are  the  P  coefficients  {C?  }i<i<p 
and  their  respective  locations  {*)  }  1  <i<p  for  which  one  could 
search  the  maximum  of  the  likelihood  hypersurface.  While  a 
direct  and  naive  approach  of  maximizing  the  likelihood  func¬ 
tion  would  generally  maximize  P,  the  solution  provided  by 
the  MDL  criterion  attaches  a  regularizing  penalty  to  lead  to 


C(CX , . . .  ,  CK ,  C,  P)  =  ~  log  P(CX ,...  ,CK\C) 

+  i  (2  A)  log  A.  (7) 

Proposition  1.  If  the  functions  pi  are  such  that 


Vu,  pi(u)  >  <pi( 0)  , 


•  The  probability  density  function  of  the  contaminating 
noise  is  unknown,  but  belongs  to  a  known  class;  the 
worst-case  scenario  is  sought  within  a  minimax  frame¬ 
work  [8], 


The  P  coefficients  Cf  ,  ■  •  •  ,  Cfp  which,  based  upon  the  MDL 
method,  give  the  optimal  coding  length  ofx( t),  are  determined 
by  the  components  Ci  which  satisfy  the  following  inequality: 

Pi(Ci)  >  log  (A)  +  <pi(  0) 


4.1.  Coding  for  Denoising  In  the  exponential-power  case,  the  above  inequality  re¬ 

duces  to  a  hard  thresholding  policy: 

The  property  of  wavelets  of  concentrating  energy  into  rela¬ 
tively  a  few  coefficients  and  its  inability  to  achieve  that  with  I  Ci  \>  ji  (log(A))1//?i  .  (8) 


Furthermore,  the  resulting  minimal  code  length  is 

r(c1:...  ,6V)  =  XJ  (min  los(A'))  - 

los(27ir(i/A)))  -  (9) 

This  provides  an  interesting  criterion  for  best  basis  search  of 
signals  embedded  in  (possibly  non-Gaussian)  noise. 

4.2.  Robust  Representation 

While  the  assumption  that  all  the  statistical  characteristics  of 
the  noise  are  known  may  hold  in  few  practical  cases,  its  an¬ 
alytical  tractability  and  appealing  closed  form  results  have 
been  the  root  casue  of  its  popularity.  To  bring  us  closer  to 
practical  scenarios,  we  follow  Huber’s  approach  by  assum¬ 
ing  that  our  noise  distribution  comes  from  a  class  of  distri¬ 
butions  Ve  =  {(1  —  e)$  +  sG  :  G  €  F},  where  $  is  the 
standard  normal  distribution,  T  is  the  set  of  all  distribution 
functions,  and  e  £  (0, 1)  is  a  known  fraction  of  contamina¬ 
tion. 

Prior  to  determining  the  coding  length,  we  have  to  iden¬ 
tify  the  model  in  V£  for  our  observed  data.  For  a  given  un¬ 
derlying  signal  whose  representation  has  a  fixed  number  of 
components,  the  expected  MDL  is  the  entropy  plus  a  con¬ 
stant  independent  of  the  prevailing  distribution  and  of  the  es¬ 
timator.  In  accordance  with  the  minimax  principle  we  seek 
the  least  favorable  noise  distribution  and  evaluate  the  MDL. 
This  is  tantamount  to  simultaneously  maximizing  the  entropy 
over  V£  and  minimizing  over  the  set  of  all  estimators  S.  In¬ 
terestingly,  the  least  favorable  distribution  in  V£  which  max¬ 
imizes  the  entropy  coincides  with  that  which  maximizes  the 
asymptotic  variance  and  derived  by  Huber  [2],  For  a  stan¬ 
dard  normal  density  with  variance  <r2  we  have  the  following 
result: 

Proposition  2.  The  least  favorable  distribution  pn{c )  in  V£ 
which  maximizes  the  entropy  is 

(  (1  —  e)f(a)eac+a  c  <  —a 
Ph(c)~1  (1  -e)f(c)  |c|  <  a 

1  (1  —  £)<j>{o)e~ac+a 2  a  <  c  (10) 

where  <j>  is  the  standard  univariate  normal  density  and  a  is 
related  to  e  by  the  equation 

»(*? -«-»)  =  <“> 

The  density  is  normal  in  the  center  and  Laplacian  on  the 
tails.  On  the  other  hand,  the  Maximum  Likelihood  estima¬ 
tor  minimizes  the  entropy  which  then  leads  to  the  notion  of 
MinMax  description  length. 


Proposition  3.  Huber’s  distribution  pu  together  with  the 
MLE  based  upon  it,  6jj,  result  in  a  minimax  MDL,  i.e.  they 
satisfy  a  saddle-point  condition. 

Using  an  exactly  similar  approach  as  that  of  the  Gaussian 
distribution,  the  minimax  description  length  leads  to  the  fol¬ 
lowing  thresholding  rule: 

Case  1  When  log  K  >  the  coefficient  estimate  is  set  to 
zero  when 

^2  (-“ICil  +y)  +  i°gA  >  o  (12) 

which  implies  that 

ia-|<  f  +  -log A’  (13) 

2  a 

Case  2  When  log  K  <  the  coefficient  estimate  is  set  to 
zero  when 

U  <  log  A-  (14) 

which  implies  that 

|  Ci  |  <  <t  \/2  log  K  (15) 

This  is  the  traditional  threshold  proposed  by  [1]  and  [3]. 

5.  BAYESIAN  APPROACH 

The  above  approaches  have  been  demonstrated  to  lead  to  good 
results  in  relatively  moderate  noise  scenarios  and  have  been 
successfully  applied  in  a  variety  of  settings.  They  are,  how¬ 
ever,  based  upon  threshold  values  which  present  two  draw¬ 
backs: 

•  They  are  directly  dependent  upon  the  noise  variance 
without  regard  to  the  signal  characteristics. 

•  They  grow  without  bound  with  the  data  record  length. 

In  some  applications  these  shortcomings  may  greatly  reduce 
the  performance  of  the  forementioned  methods  in  retrieving 
the  underlying  signal.  Fortunately,  some  prior  information 
about  the  signal  is  often  available,  and  it  is  thus  natural  to 
investigate  its  utility  to  regularize  the  estimation  problem  at 
hand. 

Let  the  probability  distributions  of  Cs  and  C„  be  de¬ 
noted  respectively  by  /  and  p  where  the  forms  of  functions 
/  and  p  are  assumed  to  be  known.  An  estimate  of  Cs  can 
be  obtained  by  the  following  Maximum  a  Posteriori  (MAP) 
estimate 

Cs  =  arg min  [—  logp(C  -  Ct)  -  log/(Cs)] . 

(16) 


By  comparing  this  approach  with  the  MDL  approach,  we  see 
that  the  regularizing  term  now  takes  a  more  elaborate  form 
allowing  us  to  account  for  probabilistic  prior  information  we 
may  have  about  the  signal  of  interest.  Interestingly,  it  can  be 
proved  that  many  thresholding  rules  may  be  included  within 
this  framework  [9],  For  instance,  if  the  noise  components 
are  i.i.d.  Gaussian  and  the  signal  components  are  i.i.d.,  zero- 
mean  and  have  a  Laplacian  distribution,  a  soft  thresholding 
policy  allows  us  to  recover  the  signal.  The  threshold  value 
is  however  independent  of  the  data  length  I<  as  it  is  equal  to 
s/2o2 /os  where  <r2  and  af  denote  respectively  the  variances 
of  Cf  and  Cf .  To  better  take  into  account  the  expected  spar¬ 
sity  of  the  components  of  the  signal  of  interest,  some  more 
appropriate  priors  can  be  chosen.  Gaussian  mixtures  con¬ 
stitute  such  valuable  statistical  models.  For  example,  in  the 
presence  of  i  .i  .d.  Gaussian  noise,  the  Bernoulli-Gaussian  dis¬ 
tribution  (which  is  a  degenerate  Gaussian  mixture)  leads  to 
an  estimate  which  is  a  tradeoff  between  a  Wiener  and  a  thresh¬ 
olding  estimator  [6].  The  estimated  components  then  read 


^Ci  if  |CV|  >  Xi 
0  otherwise 


(17) 


where  erf  is  the  variance  of  the  nonzero  values  of  Cf  and  Xi 
is  a  threshold  value  depending  on  a2 ,  erf  and  the  mixture  pa¬ 
rameter.  The  interest  of  this  Bayesian  approach  is  shown  in 
Fig.  1. 

An  important  problem  when  dealing  with  this  Bayesian 
approach  is  the  estimation  of  the  parameters  of  the  model. 
Different  algorithms  can  be  envisaged,  such  as  the  General¬ 
ized  Maximum  Likelihood  method  or  non-standard  forms  of 
the  EM  algorithm  [5].  A  fully  Bayesian  approach  can  also  be 
adopted  where  priors  are  introduced  on  the  parameters  and 
one  resorts  to  MCMC  algorithmes  in  order  to  build  an  er- 
godic  Markov  chain  whose  equilibrium  is  the  posterior  dis¬ 
tribution  of  interest  [4] 
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